We consider the problem of model-based clustering in the presence of manycorrelated, mixed continuous and discrete variables, some of which may havemissing values. Discrete variables are treated with a latent continuousvariable approach and the Dirichlet process is used to construct a mixturemodel with an unknown number of components. Variable selection is alsoperformed to identify the variables that are most influential for determiningcluster membership. The work is motivated by the need to cluster patientsthought to potentially have autism spectrum disorder (ASD) on the basis of manycognitive and/or behavioral test scores. There are a modest number of patients(~480) in the data set along with many (~100) test score variables (many ofwhich are discrete valued and/or missing). The goal of the work is to (i)cluster these patients into similar groups to help identify those with similarclinical presentation, and (ii) identify a sparse subset of tests that informthe clusters in order to eliminate unnecessary testing. The proposed approachcompares very favorably to other methods via simulation of problems of thistype. The results of the ASD analysis suggested three clusters to be mostlikely, while only four test scores had high (>0.5) posterior probability ofbeing informative. This will result in much more efficient and informativetesting. The need to cluster observations on the basis of many correlated,continuous/discrete variables with missing values, is a common problem in thehealth sciences as well as in many other disciplines.
展开▼